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Abstract 

In the Bayesian analysis of contingency table data, the selection of a prior distribution for 
either the log-linear parameters or the cell probabilities parameter is a major challenge. Though 
the conjugate prior on cell probabilities has been defined by Dawid and Lauritzen (1993) for 
decomposable graphical models, it has not been identified for the larger class of graphical models 
Markov with respect to an arbitrary undirected graph or for the even wider class of hierarchical 
log-linear models. In this paper, working with the log-linear parameters used by GLIM, we first 
define the conjugate prior for these parameters and then derive the induced prior for the cell 
probabilities: this is done for the general class of hierarchical log-linear models. We show that 
the conjugate prior has all the properties that one expects from a prior: notational simplicity, 
ability to refiect either no prior knowledge or a priori expert knowledge, a moderate number of 
hyperparameters and mathematical convenience. It also has the strong hyper Markov property 
which allows for local updates within prime components for graphical models. 
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1 Introduction 



We consider data given under the form of a contingency table representing the classification of n 
individuals according to a finite set V of criteria. Each criterion 7 G y is represented by a variable 
X~f which take values in a finite set X~ . We observe the values of the variable X = (Xj , 7 e V") 
in Z = x^izyl-y for these n individuals, and we assume that the resulting \X\ cell counts in the 
contingency table follow a multinomial distribution. We also assume that the cell probabilities 
are modeled according to a hierarchical log-linear model parametrized by interaction parameters 
represented by a vector 6. Since the class of discrete graphical models Markov with respect to an 
arbitrary undirected graph G is an important subclass of the class of hierarchical log-linear models, 
we will give special attention to that class througout the paper. 

In the Bayesian analysis of contingency table data, the selection of a prior distribution for either 
the log-linear parameters or the cell probabilities parameter is a major challenge (see Clyde and 
George, 2004). Priors are usually chosen for their conceptual and computational simplicity and for 
their ability to represent experts prior beliefs. They are also chosen so that they can conveniently be 
used for the whole class of log-linear models which includes nondecomposable as well as dcicomposable 
graphical models. Moreover their parametrization, that is the hyper-parametrization, should be such 
that hyper-parameters are compatible across models. 

As shown in Dawid and Lauritzen (1993), the conjugate prior for decomposable graphical models, 
called the hyper Dirichlet and defined for marginal cliques and separators cell probabilities has all of 
these properties and additionally has the strong hyper Markov property. The latter is very desirable 
since it allows for local updates within cliques thus simplifying the computation of Bayes factors 
in a model selection process. The hyper Dirichlet has therefore been used in many studies( see 
for example Madigan and Raftery, 1994 and Dellaportas and Forster, 1999). However, the hyper 
Dirichlet is only defined for decomposable graphical models and when it is used as a prior, the 
corresponding posterior probability for a model is only its probability within this restricted class 
thus making it difficult to compare it to the posterior probability of another model considered within 
the wider class of hierarchical log- linear models. Moreover, it appears to have many hyperparameters 
since a set of parameters has to be chosen for the Dirichlet on each clique and each separator of the 
graph. In fact, all these hyper-parameters are not independent of each other since they have to be 
hyper-consistent but the apparently large number of parameters adds a level of complexity to their 
selection. 

Consequently much eflbrt has been devoted to the study of alternative priors. For example, King 
and Brooks (2001), after a discussion on the advantages and disadvantages of the hyper Dirichlet 
propose a multivariate normal prior for the log- linear parameters for all hierarchical log-linear models. 
This prior allows for efficient computation, facilitates prior elicitation and induces a log-normal 
distribution on the cell probabilities with easy to compute prior mean and covariances. 

The aim of this paper is to show that the conjugate prior can also be defined for the wider class 
of hierarchical log-linear models in a simple way and that it has all the desirable properties that one 
traditionally wants from a prior. Indeed we will show that experts prior beliefs or lack of any prior 
information can easily be expressed by an appropriate choice of hyperparameters. The chosen prior 
is consistent with prior beliefs under both parametrization of the model. The conjugate prior is also 
hyper Markov thus leading to local updates in graphical models, a property that traditional normal 
priors on log-linear parameters do not have. Also, the number of hyperparameters is moderate, in 
fact exactly equal to the number of log-linear parameters plus one and the hyperparameters are 
hyperconsistent across prime components in graphical models and compatible across models. 

In §2, we set our notation and give some preliminary results. We work with the parametrization 
used by GLIM. It is interesting to note that this parametrization expresses the logarithm of cell 
probabilities p(i),z e X, which we regard as functions of i, as the sum of functions 9E{i) of i which are 
in orthogonal subspaces of the space R"^ of functions on I. This orthogonal decomposition of logp(i) 
will insure that hyperparameters in the prior are compatible across all models. Our parametrization 
will also lead us to express the distribution of the marginal cell counts in the contingency table, rather 
than the cells counts, as an exponential family. Using this exponential family form, we derive, in §3, 
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the expression of the Diaconis and Ylvisaker (1979) conjugate prior for the log-hncar parameters. 
We give a necessary and sufficient condition for this prior to be proper and two methods to obtain 
hyperparameters that insure that the prior is proper. In §4, we obtain the expression of the induced 
conjugate prior for the cell probabilities and in §5, we give the details of the properties we mentioned 
above. Having the expression of the induced prior on cell probabilities allow us to verify that the 
choice of hyperparameters in one parametrization (log-linear or cell probabilities) expresses the same 
prior belief in the other parametrization. 

2 The log-linear model 

2.1 The parametrization 

Let V be the set of criteria. Let X = (X^, | 7 G V) such that X^ takes its values (or levels) in the 
finite set of dimension \Ij\. When a fixed number of individuals are classified according to the 
\V\ criteria, the data is collected in a contingency table with cells indexed by combination of levels 
for the \V\ variables. We adopt the notation of Lauritzen (1996) and denote a cell by 

i — {iy, J e V) ^2 = x-ygyly. 

The count in cell i is denoted n{i) and the probability of an individual falling in cell i is denoted 
p{i)- For E C V, cells in the iJ-marginal table are denoted 

The marginal counts are denoted n(iE)- For n = J2iei ('^) ~ i ^1) follows a multinomial 

(n, p{i) , i G I) distribution with probability density function 

P(n) oc Jl p(i)"W . (2.1) 

Let i* be a fixed but arbitrary cell which for convenience we take to be the cell indexed by the 
"lowest levels" for each factor and for convenience again, we denote this level by 0. Therefore i* can 
be thought to be the cell 

i* = (0,0,..., 0). 
Consider the following parametrization 

OE{i)= ^(-l)l^\^ll0gp(i;^,»^.) (2.2) 
F<ZE 

where by Moebius inversion 

p(i) = exp ^ 0£;(z) . (2.3) 
Ecy 

We note that 6*0 (?) = logp(z*), i £l and we will therefore adopt the notation 

0%{i)^eii^, p{i*) ^ PHI ^ expeni. (2.4) 

This parametrization has been used in many papers (see for example Dellaportas and Forster, 1999) 
and can be found in Lauritzen (1996, p. 36). 

Let us make an important remark here. Since i* is fixed, the function 

i el ^ log p{iE,i*E'') 

belongs to the factor subspace Ue ( as defined in Lauritzen ,1996, Appendix B.2) of the space of 
real-valued function on T that depend only on iE- Therefore by Proposition B.4 of Lauritzen (1996) 
and H2.2|) above, OEii) belongs to the interaction subspace Ve which gives the " 'pure" ' contribution of 
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the interaction between variables in E with the interaction between variables in all C -E removed. 
This means that (|2.3|) or more precisely its equivalent expression 



logp(i) = ^ Osii 



ECV 

is the unique expansion of logp(i) into its orthogonal components in Ve,E C V. This orthogonal 
decomposition of logp(i) is the property that will make the hyperparametrization of the conjugate 
prior on 9, defined below in H2.15|l . compatible across models since all models will be expressed in 
the same orthogonal " basis" . Let us now emphasize some other properties of the 9 parametrization 
with the following three lemmas. 

Lemma 2.1 For any (i) £ X and any ECV, 9E{i) depends only on ie, that is 

9Eii)^9E{iE) . 
Since i* is fixed, the proof of this first lemma is obvious. 
Lemma 2.2 If i is such that for ^ £ E,i-y — i* — 0, then 9E{iE) ~ 
Proof: By definition and since [i fu-j t T'*pijjy) = (*_F:iFc) we have 

FC_E\7 FCE\i 

= E (-i)"'\^'iogp(*F,i;^c)- (-i)"'^^'iogp(*F,*j^c)-o. 

FCE\y FCE\i 

□ 

From this lemma, it follows immediately that our parametrization is the GLIM parametrization 
that sets to the values of the E— interaction log-linear parameters when at least one index in 
E is at level (see for example Agresti 1990, p. 150) . Therefore, for each ECV, there are only 
J^^g^(|Xy| — 1) parameters. The next lemma is actually the Hammersley- Clifford theorem and its 
proof can be found for example in Lauritzen (1996, p. 36). 

Lemma 2.3 Assuming all cell probabilities are positive, the distribution of X — {Xy,j £ V) is 
Markov with respect to the undirected graph G if and only if 9E{iE) — whenever E C V is not 
complete. 

From this lemma, it follows that the multinomial distribution of the cell counts is Markov with 
respect to a graph G if and only ii 9e is equal to zero when E is not complete, a well-known property 
that we recall here (see Darroch, Lauritzen and Speed, 1980). 
For notational convenience, we now define 

£ = {ECV, E^(b}. 

By Lemma f2. 21 for any given j G X, 9E{jE) = if there exists at least one ^ C E such that — 0. 
We therefore define for any j E T 

£;^{Ee£\j^^O,yjcE}. (2.5) 
Then, by lE3|) and Lemma lfTT)l . 



4 



which yields 

1 , , 

P0 = 1 , a ( ■ \ (2-6) 

and 

and thus all cell probabilities are expressed in terms of the free parameters 
2.2 The multinomial distribution for discrete data 

We now want to give the probability density function of the multinomial distribution under the form 
of an exponential family. This will be done successively for the saturated model i.e. Markov with 
respect to a complete graph, for models Markov with respect to an undirected graph G and for 
general hierarchical log-linear models. 
From (|2.1|l . for the saturated model, we have 

P{n) cx [](exp ^ = nexpn(z) logexp ^ BEi%E) = exp^n(i) ^ eeii) 

iei Ecy iei ecv iex ecv 

= exp^^dsiiE) ^ = exp ^ ^ 0E{iE)n{iE) 

iei EGV jex.jE=iE EcviEeiE 

= exp{ ^ ^ OEiiE)n{iE) + n0ii^} 
Ecv,E^i!)iEeiE 

Moreover, we know from Lemma 12.21 that, only those OEiis) where i-y ^ 0,j ^ E are nonzero. 
Therefore if, for E G £ we define 

rE^{iE^iij,ieE)elE, i^^O,^eE} (2.8) 

then the probability density function above becomes 

P(n) cxexp{ ^ J2 0E{iE)n{iE)+n0^)} (2.9) 

We see that, with the parametrization that we have chosen, the marginal counts n(iE), rather 
than the cell counts n{i), appear naturally as random variables . Since the Jacobian of 

n = {n{i),i E I) ^ y = ("(ie), E £ £, iE E 1*e) 

is clearly one, the family of distributions for y is the natural exponential family 

^ rr, ^ '''^'P{T.E&£Y.^E&I'JE{^E)n{iE)} , , ^ ^ff TT 

(2.10) 

where ^ is a reference measure of no particular interest to us here. This gives us the density for the 
saturated model. 

When G is an arbitrary undirected graph let 

V^{E (££\E complete} 

and for any given j € X 

V* = {DeV\j.,^0,y-feD}, (2.11) 
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From Leinnia [2.3l to obtain the expression of the ceU probabihties and of the family of multinomial 
distributions Markov with respect to G it suffices to equate ^^(i) to for E ^ V and for all {i). We 
therefore have 

P0 = T— TTT-^ (2-12) 

where it is important to note that in (|2.13|l not all p{i) are free parameters since 

for E(^V, eE{iE)^0 

which imphes that for E ^ piis^i^") function of p{ip,ipa),F (Z E, F e V . Only cell 
probabilities of the form p{i d i*D<') : io ^ D E V will be free probabilities and form the cell 
probability parameter 

P= {p{iD,i*D'=)i D eV, io e^h') with p{iD,i*Da) as in . (2.14) 

of the multinomial distribution Markov with respect to G for graphical models, or of the hierarchical 
log-linear model. The corresponding log-linear parameters are obviously 

= (ODiiD): D eV, in e Id) with eniio) as in . (2.15) 

Moreover, the family of multinomial distribution Markov with respect to G for 

y = {n{iD), D eV, in <E Tjj) 



IS 



^^LG = {fG{y;0)p,G{y) ^ rnf^ciy), 

e ]Rnoei.n-,eo(i2^Ti-i)}(^.i6) 

where 9 — (0d(*_d)j£' G Z^, id G Id) smd fiQ is a reference measure of no particular interest to us 
here. Densities in J-'^^ will be written under the natural exponential family form 

fG{y;e) = cxp{^ ^i^i'D)n{iD)}-n\og[l+ J2 cxp ^ 0z?(*z?)) (2.17) 



When the model is a hierarchical log-linear model, let V be the set of subsets of V representing 
the set of all possible interactions in the given model, which we will call the generating set. Then, 
the expression of the cell probabilities and of the multinomial distribution for this model is the same 
as in H2.12|l . (|2.13|) and H2.16|l but with T) representing the generating set for the model. 



2.3 The multinomial distribution for binary data 

We consider here the important special case of binary data, because it occurs often in practice and 
also because in this case, the notation is somewhat simpler. When the variables Xj, j £ V can only 
take two values or 1, there is only one cell i in each X^^F £ £ and therefore each cell i = (z-y, 7 e V) 
can be indexed by E — {-f E V : ij — 1} for E E £ U9 such that z-y 7^ 0, 7 G -E. The correspondence 
between T and £ U is one to one. For i — {i-^ — 1,'f £ E,i~f — O.i ^ E), we will therefore use the 
notation 

Pe ^p{i) for and 9e = dEiis)- 
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The relation (|2.2|) becomes 



eF^J2 (-i)'^^"" logPB = log n Ph 

ECF 



(-1)1 



, F e£ui 



ECF 



Or equivalently by Moebius inversion 

logpF = (^E, F e £ with logp0 = 6*0 , 



ECF 



Then l(T7jl and become respectively 

1 



P0 



1 + EecS ^'^V{T,FCE,FeV ^f} 

1 + J^Hcs cxpIEfch.fgi? ^f} 



E eS 



(2.18) 

(2.19) 

(2.20) 
(2.21) 



and 



r^.a = {/(y; G)fiG{y) = exp ( ^ 0,52/1, - n log(l + ^ exp( ^ 0,,))) ^^(y) (2.22) 



EcS DCE.DeV 



e 



ID 



where 2? is equal to £, the set of complete subsets of in G or the generating set for the hierarchical 
model for, respectively, the saturated model, graphical model with respect to G or the hierarchical 
model. 

We note that H2.14|l and H2.15II become 

p = {pD,D G V) and 6 = [Bd-D e V) 

respectively. 

2.4 An example 

We consider the case where X = {Xa^ Xi,,Xc, Xd) is Markov with respect to the four-cycle as given 
below and where the variables are binary. 



We then have 



T) = {a, 5, c, d, ab, be, cd, da} 

£ = {a, 6, c, d, ab, be, cd, da, ac, bd, abe, bed, eda, dab, abed} 



The linear constraints on 9e,E <^ T) are 



7ac — Obd — Uabc — Obcd — Veda — Odab 



— dated — 
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and using H2.18|l . we obtain pe,E <^ T) in terms oi p = {pE, E eV) as follows 

_ PaPc _ PbPd _ PabPbc _ PbcPcd _ PcdPda _ PdaPab 

Pac — ; Pbd — 7 Pabc — i Pbcd — ; Pcda — 7 Pdab — ; 

PHl Pdl Pb Pc Pd Pa 

_ PabPbcPcdPdaP(tl 
Pabcd — 

PaPbPcPd 

The cell probability parameters of the multinomial distribution Markov with respect to the four-cycle 
above can be written in terms of 9 as 

Pd = P0e^", D £ {a,b,c,d,} with 

Pab — Pil^ : Pbc — P$S , Pcd — Pil^ , Pda — P$G , 

PabPbc PbcPcd PcdPda PdaPab PabPbcPcdPda 

Pabc — , Pbcd — , Pcda — , Pdab ~ , Pabcd — P$ 

Pb Pc Pd Pa PaPbPcPd 

3 The conjugate prior for the log-Unear parameter 9 

From H2.17|l . it is clear that, for the three nested classes of models considered in this paper, graphical 
with respect to G decomposable, graphical with respect to an arbitrary undirected G and hierarchi- 
cal, the probability density function for the marginal counts y can be written under an exponential 
family form and therefore the form of the conjugate prior for 9 is given immediately (see Diaconis 
and Ylvisaker, 1979) by 

770(918, a) = /g(s,q;)"^ exp{ ^ ^ 6'£,(i£,)s(i£,)} - alog (l-f ^ exp J2 ^d(«d))3.1) 

where /g(s, a) is the normalising constant 

/g(s, a) =/i-r i-r 7TG{9\s,a)d9 (3.2) 

and where, as usual, V is equal to £ when the model is saturated, to the set of complete subsets of 
G when the model is graphical Markov with respect to G and to the generating set for the model 
when the model is hierarchical. 

In order to be able to use this prior in practice, we need to answer a number of questions. The first 
basic question is to know for which values of the hyper parameters (s, a) where 

.3 ^ {s{iD), D eV,iD elh) e RHi^e^'n^eDd^^l-i)) ^^j^ 

the distribution is proper, i.e. when does Ids, a) < +oo hold. We will now give a necessary 
and sufficient condition for to be proper as well as two practical methods to construct hyper 
parameters (s, a) such that it is proper. The next set of questions is concerned with the properties of 
this prior distribution in practice such as ease of prior specification, hyper Markov property. These 
questions will be addressed in §5. 

3.1 A necessary and sufficient condition for the prior to be proper 

Lemma 3.1 The prior distribution i)^. is properif and only if (s, a) belongs to 

Ug = Us, a) I a > 0, (^^ V p{j), DeV, io^ I|,)) withpij) as in . (3.3) 

a ^ — ' 
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Proof: Since the parameter space of (|2.16() is Qc ~ Hnueu Il^eu^l-^^l |-,y Theorem 1 of Diaconis 
and Ylvisaker (1974), a necessary and sufficient condition for 1)3. 2|l to be finite is that a > and 
= ^(s(iu)^ D ^ V, Id ^ Id) is in the interior of the convex hull of the support of ^g- Since 
the Laplace transform 

L^^{e) = (l+ J2 '^^P E ^d{id) 

is defined for Qq which is an open set, the interior of the convex hull of the support of is equal 
to the mean space Mq oi J-^^. We therefore want to identify Mq- Let k^^{9) — \ogL^^{9). Since 
J-fj^^ is a natural exponential family with parameter 9 G 0^, we have 

MG = {m=(m(zz3), ^I3eX^,)|m(^z3) = E{n(iD)) ^ n'^^^f^ ^ n V 13.4) 

where p{j) is as in (|2.21(l . It follows immediately that {s,a) G Hq is a necessary and sufficient 
condition for TrG{9\s, a) to be proper. □ 

From the lemma above, it is clear that in order to belong to XIc, (s, a) must satisfy 

a > maxD^xiSD and sd > se for D d E, D, E G V . 

However, this condition is not sufhcient since (s, a) must also be such that the p{j) in 1)3.3(1 satisfy 
the conditions 9E{jE) = 0, E <^V. 

3.2 Two methods to construct (s, a) G YIq 

From (|3.3|l . we immediately obtain the following method to construct hyper parameters (s, a) which 
are in Hg: 

1. Choose an arbitrary 9 ~ {9{iD), D G T>\ io G 1*^,) 

2. Compute p(i) according to H2.13|l . 

3. Compute %i = for D eV, e I},. 

4. Take a = l. 

Another practical way to construct (s, a) G is to start with a " prior contingency table" with 
all cell counts n[i) positive. With n denoting the total count in the given contingency table, the 
maximum likelihood estimate p oi p satisfying the equations 

n{iD) =n ^ p(i), D eV, in Gl'h 
and the constraints of the model, exists and therefore we can take 



a = n, s(i_D) = n{iu), I? G P, io ^T] 



D 



thus obtaining hyperparameters in IIg. 

We note here that these hyperparameters are consistent across models since the "marginal counts" 
do not change when we take different models. Marginal counts do not change either when we take 
marginal or conditional models. 

4 The induced prior on the cell probabilities 

In this section, we will give the expression of the induced conjugate prior in terms of p, the cell 
probability parameter, first for graphical models Markov with respect to a decomposable G thus 
making the link between the hyper Dirichlet and our conjugate prior, then for models Markov with 
respect to an arbitrary graph G and finally for general hierarchical models. 
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4.1 The conjugate prior when G is decomposable 

For G decomposable with set of cliques C — {Ci,l — l,...,fc} and set of minimal separators 
S = {Si,i — 2,...,fc}, Dawid and Lauritzen (1993) defined the conjugate prior in terms of cell 
probabilities and called it the hyper Dirichlet distribution. Its density is expressed in terms of 

p^'{iD,i*D^), DCCi,l = l,...,k, p'^'iiD,ih^), D^Si, l = 2,...,k, DeVJoelh, (4-1) 

the cell probabilities for the cliques and separators marginal tables, respectively. Note that in this 
subsection, for D C Ci or D C Si, D'^ denotes the complement of D in C; or Si respectively. The 
density of the hyper Dirichlet is equal to 



with 



nt 1 Dire, (P0 ' , P^' {tp ,t*D.);a'^',a^'{iD,i*n.),D eV^^,iD eT^) 



'D'^rCi{p'^',P^'{iD,i*Dr-);a^'{iD,i*D'=),D G V^'Jd £ 1^) 



(4.2) 



with a similar expression for Dirg, and where the hyper parameters 

ia^'iiD:i*D^),D eV^\iD elh) and {a^^ (io^ih^), D e V^' ,id e Ih) (4.3) 
are hyperconsistent. 

Since T:G(0\s,a) in (|3.1|l is the conjugate prior to the multinomial Markov with respect to G, 
it must coincide with the hyper Dirichlet when G is decomposable. The aim of this subsection is 
to give the correspondence between the parameter (s, a) and the parameters of the hyper Dirichlet 
explicitly. 

The probabilities in H4.1|l are not all free variables since, by the Markov properties of the multi- 
nomial distribution, 

p(.) = nLp^-fe) 
nt^p^'fe) 

and therefore some are functions of the others. Let T>'~^' and denote the set of nonempty subsets 
of Gi and Si respectively. We can choose the free marginal probabilities to be 

pG = DeV^''\ Uj-^2l3^^ , Z = 1, . . . , fc,p^' {lD,l*D.), DeV^',l^2,...,k,iDeTD) ■ 

(4.4) 

The Jacobian of the change of variable 9 i— > p*^ is given in the following lemma. 

Lemma 4.1 The Jacobian of the change of variables from 6 — {0{i]j),D G G X^) as given in 

K2.S}^) to p'^ as given in \4-4\l is 



dp^ 



(4.5) 



The proof of this lemma is given in the Appendix. The correspondence between (s, a) and 
given by the following proposition. 
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Proposition 4.1 When the graph G is decomposable with set of cliques {Ci, i = I, . . . , k) and sets 
of minimal separators {Si,i — 2, . . . ,k), the conjugate prior induced from Hc/.1\) is identical to the 
hyper Dirichlet i4.S\ ) with hyper parameters i4.S{ ) where 

a'''{^D,^h^) = ^ E = « + E (-1)"" E ^(*^) (4-6) 

a''{^n,^h^)^ E E ("1)"'^^'^^^^) = « + E (^1)"" E 

SiDFDDjpeX'p\(jp)o=iD DCS, ieT^ 

Moreover 

/G(s,a) = — r 5 (4.8) 

Proof: Since the distribution of Y in H2.22|l is Markov with respect to G, we have that 

Then 

i; fc 

- E (-1)"^^^' ( E ^^SP""' (*Fnc, , *;^enc, ) - E ^^SP'' (*^ns, , ^F.nS, )) 

1=1 FCE 1=2 FCE 

k k 

(=1 i=2 

If £; C C;, n C/ = £; and 6l^'(i£;nc,) = ■ If ^ C;, then by Lemma e%'{iEnCi) = and 
similarly for 0^' (iEnSi)- We therefore have 

fe fc 

^z?(«D)=E^&(*i5)-E^§(*^)' (4.10) 



1=1 1=2 



where 



(*d) = E (-l)"'^'''logP^'(ij^,*J^c), for Dec, 



9g■(^D) = for ^ 



and similar expressions for 9^^ [in) (see also Consonni and Leucari, 2005 for the derivation of these 
formulas in the case of bivariate data). From H4.9|) . we also have 



log(l+ E exp E ^oOd)) = -logP0 = -(El°g^'^' -ElogPeJ (4.11) 

jei,j^t' Dev. 1=1 1=2 

Therefore (|3.1|l can be written as 
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7rG(6'(p)|s,a) oc 











f alogp^'l 


nf=2exp{EDGX)S, 








-alogpg'} 



0^2 exp { Ebgi^s, E^eGI^ logp^?' (ifi, + logp^' } 



(4.12) 



where a*"' (ib, i^c), o;'^' (i^;, i|;c), a^' and ag' are as defined in 14. 6|) and 14. 7|) . 
The induced prior on p is obtained by multiplying (|4.12|) by the Jacobian (|8.2I) and it follows 
immediately that it is the hyper Dirichlet with hyper parameters as given in gSI) and l|T7|) . 
The expression of (|4.8|) is obtained by noticing that for any C,; or Si, 



This completes the proof. □ 



4.2 The conjugate prior when G is arbitrary 

To obtain the conjugate prior in terms of p, we need to compute the Jacobian ^ of the transformation 
from to p as defined respectively in H2.15|l and H2.14|l . Before doing so, we need to define the 
following quantities. For C G P, H e 8, let 

Fda.M ^ { l'''^'-' It^^: for .c C H. C H , (4,13) 

These F{ic,iH) can be gathered in a riDei5 \'^*d\ ^ IlHef \'^h\ matrix F where the rows are indexed 
by io G , D (z T) and the columns by ju G , 

For example, in the case of binary data for T) and £ as given in §2.4 the matrix F is 



F = 





1 











1 








1 


1 





1 





1 


1 


1 








1 








1 


1 











1 


1 


1 





1 


1 











1 








1 


1 





1 





1 


1 


1 





1 














1 








1 


1 





1 





1 


1 


1 


1 

















-1 

















-1 








-1 


-1 




















-1 














-1 


-1 








-1 























-1 














-1 


-1 





-1 


Vo 























-1 














-1 


-1 


-1 



We also need the following two lemmas. Their proof is given in the Appendix. 

Lemma 4.2 Let G be a nondecomposable prime graph. For the matrix F as described in J^. 
the sum of the entries in each column jn, j G 2r|f , H ^ £ is such that 

Pi^cjH) = 1 (4.15) 
if and only if H , as an induced subgraph of G, is decomposable and connected. 



12 



We are now in a position to give the expression of the Jacobian. Let 

U — {F ^ £ \F is either nondecomposable or nonconnected } 
and we also write Uq — U U {(!>}. 
Lemma 4.3 Let 

= ( E F{tc,jH)-l), J^Th, He£u9. (4.16) 
The Jacobian of the transformation 

P = {piiD,i*D^), melh, DeV)^ 9 = [eD^io), io elh, De V), (4.17) 
where p is as given in \2.14^ and 9 as in 1^2. ^) is 



II - (n np(-j.o)(-E 



p(jff,jff<=)~ 

The proof of this lemma is given in the Appendix. 

We can now give the conjugate prior (jH.ll) in terms of p as given in (|2.14|l . Let us note first that 
by 1221 and (|TTH|l . the marginal cell counts y — (n{i]j),i]j e D E V) for the multinomial 
distribution Markov with respect to G has density 

fiiy\p) oc n n (n^(^-'^-)^-^^""T'"W 

DeVioeT"^ FCD 

« n n Pi^DXn^r^'^^pf'^ (4.19) 

where u{tn) = ErDoi-^V'''' E,, | o,)„=.„ vUf) and u(0) ^ n + E^^eI'^Mevi~^m^D). 

Theorem 4.1 For (s,a) G IIg a// as given in ^4.16}) , the conjugate prior distribution induced 
from IS. l\l by \4-ll[ , that is, the conjugate prior for the parameter p of the multinomial family of 
distributions ^4-19)) is 



where 



K 



1-;;7E E ^ijH,j*H-)pUH,j*H- 



ai^D,^h^) = E E {-if'^'' si^p) (4.21) 

^ a+ • (4-22) 
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This result follows immediately from the expression of the conjugate prior in terms of 9, H2.2|) 
and 

Example 

When the graph is the four cycle with binary data as considered before 

Uq ~ {ac, bd, abed, 0}. 
From (|4.14() and the constraints 9E{iE) = for E ^ T), we have 

_ _ , _ _, Pac _ PaPc Pbd _ PbPd Pabcd _ PabPbcPcdPda 

P0 P0 P% P0 P0 PaPbPcPd 

and 

lG(^,Ui) Pa Pb Pc Pd Pab Pbc Pcd Pda 

(1 _ P^Pg _ PfcPrf ^ PabPbcPcdPda y I 
pI pI PaPbPcPd 

4.3 The conjugate prior for a general hierarchical model 

When the model is not specified to be graphical but is a hierarchical log-linear model, we can also 
obtain the induced prior in terms of p and the statement is similar to Theorem 14.11 above except 
that the term coming from the Jacobian | ^ | is more general and we have 

Theorem 4.2 For (s, a) e lie the conjugate prior distribution induced from \S.l]) by that 
is, the conjugate prior for the parameter p of the multinomial family of distributions i4.19\) for the 
hierarchical log-linear model is as in \4.ZU\j with 

HeSjnei'fj {DCH,Dev} {ccD.cev} 

and a{iD,i*]:,c) and as in and \4-Z^ 

The proof follows immediately from the expression of the conjugate prior (|3.1|) in terms of 6, (|2.2II . 
(|ItH|| and Remark (Igl)) . 

5 Properties of the conjugate prior 

5.1 Hyper-parameter specification 

Let us now turn to the practical problem of choosing hyperparameters which will reflect either some 
prior belief or lack of prior belief. 

Suppose first that we do not have any prior information and want to put a flat prior on the 
log-linear parameters. From the expression H2.17(l of the distribution of the marginal counts y = 
D Cz Vjijj E 2^)), it is clear that the hyperparameters s{i]j) can be thought of as the "prior 
marginal counts" for the marginal cell io- Therefore, we can take for s(i£)) the set of i^-" marginal 
counts" , D £ V^io £ I'jj for a "prior" contingency table with all "cell counts"' equal to j. We can 
also take a to be the "total count", that is 1 in this case. This would lead, of course to 

«('^) = E ^ ^ n i^^i ' 
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where D'^ is the complement of D in V . Since for the saturated model, the conjugate prior in terms 
of cell probabilities is the Dirichlet, it is clear that this choice of hyperparameters also yields a flat 
prior for the cell probabilities of the saturated model with all hyperparameters being equal to j. 
This prior is in fact the vague prior advocated by Perks (1947) (see also Dellaportas and Forster, 
1999). 

If we have prior information, we can first exclude all the interactions that are thought to be 
absent. Indeed, by Lemma [2.31 if two variables are believed to be independent given the others, 
then all 0_b(«_e) = for E € £ containing these two variables. We may have additional information 
such as the knowledge of positive or negative interaction between one or more variables. This 
knowledge can be expressed by computing the expected value and variance for e^^ for appropriate 
D ^ v. To illustrate what we mean, let us consider the data given by Hook, Albright and Cross 
(1980) and studied by King and Brooks (2001). In this data set, there are three variables 

a = BC, b = DC, c = MR 

each taking the values 1 or representing the presence or absence of, respectively, birth certificates, 
death certificates and medical records for each individual. The individuals under study are children 
with spina bifida. The data consists of an incomplete contingency table for each one of six years. 
From Hook, Albright and Cross (1980), it can reasonably be assumed that the model is the decom- 
posable graphical model with cliques a and be. Since the data is binary, from (|2.22l) . the conjugate 
prior will then be of the form 

TTGiO\s,a) = lGis,ay^ exp (^0aSa + ObSb + OcSc + ObcSbc (5.1) 

-alog(l + e^" + e"" + e"" + e''°+^'' + 6'^"+'''^ + e''''+''''+'^'"' + e''»+^''+''<=+''f' 
There is also some prior knowledge about the interaction between b and c, that is for 

gS,, ^ PbcPH _ 
PbPc ' 

With high probability, e^''" is expected to be in the interval (—.9, —.1). From (|5.1|l and the formulas 
given in Proposition l4.1l if we let s' = (sq, Sb, Sc, Sbc + ^) we have 



Ids', a) 
lG{s,a) 

T{sbc + l)r(Sfc - Sbc - l)r(Sc - Sbc - l)r(Q; - Sb - Sc + Sbc 

r(sbc)r(s& - Sbc)r(sc - sf,c)r(a - sb - sc + sbc) 

Sbc{a - Sb - Sc + Sbc) 



(sb - Sbc - l)(Sc - Sbc - 1) 

We therefore have the constraint 



(5.2) 



g ^ Sbcja - Sb - Sc + Sbc) ^ _ ^ 
~ {sb - Sbc - l)(Sc - Sbc - 1) ~ 

In the absence of any prior knowledge on the other log-linear parameters, we can assume that 
their expectation is around which would imply that 



Eie'^^) = 
^(e"") = 
i?(e^=) = 



r(Q; - Sg - l)r(Sa -I- 1) _ Sg 

T{a~Sg)T{sg) a-Sg-l 
r(a ~ Sb - I - Sc + Sbc)T{sb + 1 - Sbc) _ Sb - Sbc 



T{a - Sb - Sc + Sbc)r(s6 - Sbc) a - Sb - Sc + Sbc - 1 

r(Q; - 5fc - 1 - Sc + Sbc)r(Sc + 1 - Sbc) _ Sc - Sbc 

r(a - Sb - Sc -I- Sf,c)r(sc - Sbc) a - Sb - Sc + Sbc - I 
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are all around 1. If we took all three ratios to be 1, we would obtain the relationships 



2sa = a - 1, Sb - She = Sc - s&c, 2(sb - Sbc) = a - Sc - 1, 2(sc - Sbc) = a - Sb - 1, 
_.9< f^^ <_.i. (5.3) 

Sb - Sbc - 1 

and choose appropriate (s, a) satisfying these conditions. We might also want to compute the 
variance of these quantities which is, of course, also immediate with the results of Proposition 14. II 
and give an interval where we wish _E(e^°), D G {a, 6, c} to be. 

In general, when the model considered is not necessarily a decomposable graphical model, the 
ratio of normalising constants of the type ^fj^g '^^^ has to be computed numerically. This is feasible 
by any or the standard MCMC or approximation methods. However, it might be wiser and much 
simpler to choose a decomposable model covering the interaction believed to be true. For example, 
if, in the example above, the prior model was believed to be the hierarchical model with generating 
class {ab, be, ca}, then a reasonable prior model would be the saturated model Markov with respect 
to the complete graph subject to the fact that the interaction between a, b and c is weak, that is 
E{e^'''"=) is close to 1. 

It remains to know whether the hyperparameters chosen for the conjugate prior on the log- 
linear parameter 9 will yield hyper parameters in the conjugate prior induced by for the cell 
probabilities which are consistent with the given prior beliefs. From Theorem 14.11 we know that 
the induced prior for the cell probabilities '"looks" like a Dirichlet on the free cell probabilities, 
that is p{iD,i*Do), D & V^i £ 1'^ with an additional factor for the Jacobian. The powers of the 
p{iD,ii)o) correspond to "prior cell counts" nliD) and therefore any choice s(i_D) in (|3.1|l will have 
the same meaning in H4.20|l . For example, corresponding to the condition that H5.2|l be in the interval 
(—.9, —.1) corresponds the condition that 

PbPc 

be in that interval also. From 14.20|l . the conjugate prior on p = {pa,Pb,Pc,Pbc) is 



1 



PaPb+PaPc+PaPbc 





Therefore 



^^ PbcP9 ^ ^ lG{s',a) 



PbPc Ids, a) 

where s' = {sa, s;, — 1, Sc ^ 1, Sfcc + 1) and it follows immediately that 

T^,PbcP$ s Sbc{a - Sb - Sc + Sbc) 

h{ ) - 



PbPc {sb - Sbc - l)(Sc - Sfcc - 1) 

thus giving the same condition as in (|5.2|) . 



5.2 The strong hyper Markov property for local updates in graphical 
model 

Let us now assume that the multinomial distribution of the contingency cell counts is Markov with 
respect to an arbitrary undirected graph G. We know from Dawid and Lauritzen (1993) that the 
multinomial distribution is strong meta Markov and as the conjugate distribution of the parameter 
6 of the exponential family H2.22|l . the conjugate prior (|3.1|l is strong hyper Markov. 

Consider the decomposition of G into its prime components and let P;, ^ = 1, . . . , fc be a perfect 
enumeration of these components. Let Si, I = 2,...,k be the corresponding separators. We now 
give the expression of (|3.1|l as the Markov ratio of conjugate priors on the prime components and 
the separators of G. 
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Proposition 5.1 The conjugate prior ^5'. 1}) can be written as the Markov ratio 

nL's(«a|»a,o) 

where 

TTp,{9^'\s^',a) (5.5) 
and where s^' = (5(^0), V^'^io e X^,) and s'^' = (s(/d), P'^Sic G I^)- 

The induced conjugate prior \4-Z0^ on p can be written as the corresponding Markov ratio of conjugate 
priors induced on p^^ and p^^ from ttp, {6^' |s^' , a) in 15. 5]) and TiSi{&^'^\s^\oi) . 

Proof: It is not difficult to see that 

fe fc 

eoiiD) ^Y.^^'iiiiD) - Y^f^^Di^D), DeV (5.6) 

1=1 1=2 

where 



'{id) = (-l)"'^^'logP^'(ij^,*J^c), forDCQ 



0g(zi3) = ior D%Pul^l,...,k, 9fjiiD)=0 for D g Si, 1^2,..., k,iD elh 

We have proved this property for a decomposable graph G in §4.1, l|4.10|l . The proof goes exactly 
along the same lines here. Therefore, if we let 

V^' ={D eV \ DC Pi}, V^' ={D eV\D C Si}, 
=(e^'(^D),i?eI?^^^I)eX^,), 0^' ^{e^'{iD),DeV'^',iD€TD), 



and 

we see that 



k k 

Y <^D)eD{^D) ^ Y E 0'^{^d)s{id)-Y E (^d{^dWd) 
Since by the Markov property, we also have p0 = |ifc=' — |- , that is 

k 

log(l+ Y exp ^ (?z5(*z,)) = 5]log(l+ Y exp ^ (?g(*z,) 
iei,i#i* DeV* 1=1 i^'^Pi DeV* 

k 



^log(l+ Y exp ^ 0g(iz5) 



1=2 i£Xs,,i^i' D£V* 

it follows immediately that (|5.4|l is verified. 
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We note that as the restriction of s to V^^id G 22), the coefficients of {6^' {io), D G V^^id G 
in H5.5I) are consistent across prime components and separators. The factorization of the induced 
conjugate prior on p can be proved in a similar fashion. □ 

For given data y with total count n, the posterior distribution of 9 given y will be 

T^G(0\s + y,a + n) = (5.7) 

When comparing two models G an dC the Bayes factor is the ratio of quantities of the type 

lG'{s,a) 
lG'{s,a) ■ 

In the restricted class of decomposable graphical models, it is well-known that one can go from one 
decomposable graph to another through a succession of graphs that differ by only one edge. The 
additional edge can only belong to one clique in the new graph and as a consequence the Bayes 
factor affects only the graph induced by two cliques (see (37) in Dawid and Lauritzen, 1993). We 
are not aware of any such rule in the case of nondecomposable models. However, it is clear that the 
Bayes ratio will only involve the computation of the normalising constants for the subgraph induced 
by the prime components P; affected by the additional edge. 



6 Conclusion 

In this paper we have studied the conjugate prior for the log-linear parameters of discrete hierarchical 
log-linear models and its induced prior on the cell probability parameter p thus extending the hyper 
Dirichlet which was the only form of the conjugate prior identified so far. 

This prior has all the properties that one usually requires. As we have shown it, it has a moderate 
number of hyper-parameters precisely as many as there are log-linear parameters plus one. These 
hyperparameters are consistent across models. It is not difficult to translate prior knowledge into 
constraints for the hyper-parameters and constraints both in terms of the log-linear parameters and 
cell probabilities are consistent with prior beliefs, as illustrated in §5.1. 

This prior has the additional property of being strong hyper Markov, thus leading to local updates 
for the computation of Bayes factors and it is also, of course, mathematically convenient since the 
prior and the posterior have the same form as the likelihood. The conjugate prior should therefore 
be one of the priors used for the study of contingency tables with a multinomial distribution for the 
cell counts. Though we have not mentioned it above, the translation of our results to the case of 
Poisson sampling is immediate. 
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8 Appendix 

8.1 Proof of Lemma 14.11 

We will first give the proof in the case where G is the simple decomposable graph a b c and 

the data is bivariate. We will then sketch the proof for the general case of an arbitrary decomposable 
graph and discrete data. 

In the particular case of bivariate data, p'~^ in Lemma |4 . 1 1 becomes 

p« = (pg%DeI?cA(u'=22?s,), * = l,...,fc, p^ji,D(,Vs„ i^2,...,k) (8.1) 

and the Jacobian of the change of variables from 6 — {9 £> , D <E T>) a.s given in H2.18|l to p'~^ as given 
in H8.1|l is 



d0 "1 ULiIlDe-D?' Pd 



dp^ 



Y\^=2 n,5GD^ Pd 



■i.2) 



We are therefore going to first prove (|8.2|) for the two-chain graph above. In this case we have 
Ci = {a, b},C2 = [b, c},S={b}, p^ = (p^i , , P?' , P?c - p!) and 

^P^ ^ Pb_ ^e^ ^ Pc_ ^ea+Oab ^ P^ ^Sba+Oc ^ P^ 

P0 ' P0 ' P0 ' Pb ' Pb ' 

Moreover, since the multinomial distribution is Markov with respect to the graph G, we have 

PabPbc , PaPc 

Pabc = and Pac = • 

Pb pm 



Therefore 



Pa' ^ Pa + Pac ^ Pa + ^ Pa 

P^' Pm+Pc P0(1 + ^) P0 

P^^^Pc 
P^' P0 

Pab _ Pab + Pabc _ Pab + ) _ Pob 



Pb 



Pb+Pbc Pb (1 + J^) Pb 



Pbc _ Pbc + Pabc _ Pbc + ^) _ Pbc 
p'l;' ~ Pb+Pab ^ Pb (1 + ^)^ Pb 



Pb 

pI _Pb+Pab+Pbc+'^ Pb(l + ^)(1 + ^) 



P0+Pa+Pc+^ P0 (l + f^)(l + ft) 



(8.3) 

(8.4) 
(8.5) 
(8.6) 
(8.7) 
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We introduce the intermediate variables 

Pa' Pb 

From to (|H7|l . we have 



.C2 ■ 



Wafc 



Pab 



C2 
Pbc 



Vab 



Vbi 



Pb 



and 



Pb 



C2 ■ 



Vb 



(1 + Vab){l + Wfcc) 



It is then immediate to see that 



Dev 



Moreover, since ^ Pb' + P^b = Pb^ + Pbc ' t^^^n 1 - pj^i - - p)^^! = 1 - p'^^ - and 

similarly p^^ = 1 — p\ 



.^2 



„C2 



^ p£>fp£p[>£ 

Ci 



,C2 



Pa' 



pf. Therefore 

r,C2 



PS 



1-P?'-Pf' ^ 



1-P?^-Pf 



Pab 



Pi-Pab 



and the matrix of the Jacobian 



dv 



IS 



dv 
dp^ 



\ 








(I-Pa'l-Pf)^ 






























1 

(l-p?)= 











C2 

Pbc 



Pl-Pbc' 



Vb 



pI 



(8.9) 







i-p? 



(8.10) 



(I-Pc'-Pf)^ / 

The Jacobian is equal to the product of the diagonal elements and since 1 — p^ - 



dv 



dp'^ 



{pI?{pI? 



{pIY 



{p'i'Y{Pb'?{pl?{Pb'Y{p''i-Y {p'imPb'YiPb'Yip'i'Y 



Therefore 



de 



dpG 



pIp! 



^^'pa'p':^p'i;'p^'p?'pZ'p?' 



.11) 



.12) 



which proves the lemma for the simple two-link chain graph considered. 

For a general decomposable graph with bivariate data, if we write S = U^^j'S'ii then, the intermediate 
variables will be 



V = {^,D e Vc, \ (U,t2^?S,), i = 1, 
Pons 



Pi 



,k) 



and the proof will follow the same lines as above. 

In the case of discrete data, the proof follows the same line as the proof above with the following 
substitutions. For D ^ V, 

9d becomes (90(10), io & ^d) 
po; becomes {piio),iD ^I'd) 

p^ and p§ become p'^^(iu) and p^'-iio) respectively, io G I|) 
vu becomes (wd(«_d), *d G I^) 

p^' = 1 — ^ p~^\ becomes p^' = 1 — ^ ^ p'^'^iio) and similarly for p^' 



Dev 
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8.2 Proof of Lemma 14.21 



For ease of notation, we will give the proof of the lemma in the case of binary data. Since for each 
C E V and H E £ there is only one cell in 2'^ and respectively, we will adopt the notation 

Fc,H = Fiic,jH), CeV,He£. 

Let us first prove that if H is decomposable, then (|4.15|l is true. We proceed by induction on the 

number k of chques of H. Let C = {Ci, . . . , Cfc} be a perfect ordering of the chques of H. 

If H is complete, that is fc = 1, we consider two cases, the case where \H\ is even and the case 

where it is odd. For \H\ = 2p,p E N, there are Ue = X)fc=i ( ^oj,^ ) nonempty subsets of H of even 



cardinality and n, 




subsets of odd cardinality. Therefore 



CCqH k=l ^ ■' ^ ^ 

and (|4.15() is verified. We omit the proof for the case \H\ = 2p — 1 which is parallel to that of the 
previous case. Therefore H4.15|l is verified for k = 1. 

Let us now assume that H is decomposable but not complete, that is fc > 1 and let us assume 
that H4.15|l is true for any decomposable subset with k — 1 cliques. It is well-know from the theory 
of decomposable graphs that, if we write Hk-i — U^^^iQ, then H = Hk-i U (Cfe \ Sk) where 
Sk = Hk-i n Cfe is the k-th minimal separator in H . Therefore we have 

E ^c.H - E ^C7,H + ( E ^c,// - E ^c.//) . (8.13) 

The first term on the right hand side of H8.13|l is equal to 1 by our induction assumption while each 
one of the two other terms is also equal to 1 because both Ck and Sk are complete and therefore 
(|4.15(l is also verified for decomposable H. 

Let us now prove that if H is not decomposable and connected, '^ccqH Fc,h cannot be equal 
to 1. If H is not connected and its connected components i?'^^^ . . . for some I > 2, are all 

decomposable, we clearly have 

I 

If H is not connected and its components are not all decomposable, this implies that there is a 
nondecomposable subset Fi of G which can be separated from another subset F2 of G but this 
contradicts our assumption that G is a prime component of G. So, this case does not occur. 
If H is not decomposable and connected, consider its set of cliques {Ci, . . . ,Cfc}. Since H is not 
decomposable, there is no perfect ordering of the cliques and therefore for any given ordering, there 
exist a nonempty subset Q C {3, . . . , fc} such that for j E Q, there is no j < j in the given ordering 
of the cliques of H with Sj — Cj fl i^'iZi^i) ^ ^^'^ therefore 

S, = C, n (uCi G) = (BtiSj, , 2<s,<j-l 

where the 5^, can be chosen to be disjoints, with Sj, C Cj fl Cm for some m E {1, ... ,j — 1}. 

For j E Q = {2, . . . ,k}\Q, there exists i < j in the given ordering of the cliques of H such that 
Sj C a. Therefore 

E ^c,H = E ^c-.H + El E Pc,H- E ^cm) (8.14) 

CCgH CQgCi j^q CCgCj CCgS, 

+ E( E PcM^f: E Pcm). (8.15) 

jeQ cqgCj 1=1 ccgSj, 
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The sums J2ccgU ^c,h, U — Ci, Cj,Sj,j £ Q are all equal to 1 since each of Ci, Cj,Sj,j £ Q 
are complete and connected and therefore the right hand side of H8.14I) is equal to 1. For the same 
reason, on hne (|8.15() . for U = Cj,Sji,j € Q,l = 1, . . . , Sj^J^cdcU ^CM = 1- Since Sj > 2, 

CCaCj 1=1 CCgSj, 

and therefore the sum on line H8.15|l is less than or equal to — It follows that 

CCgH 

and in particular it cannot be equal to 1. The lemma is now proved. 
8.3 Proof of Lemma 14.31 

Here again, we will give the proof of the lemma for binary data and we will use the notation of §2.3. 
To shorten notation, we will write E Cq F to indicate that E C F and E £ V. 
It is more convenient to compute |^|, express it in function of 9 and take its inverse. From the 
expression (|2.21|) oi po^D £ V, we have 

dpn e^-^oo (e^^c-oo ) ^^^^^^^^ e^-^o- 



Fe£,FDD 



dpD (e^"^«°'")EFe£,FDce^^ 



C £V,C (^D 



-PD J2 pp ■ (^-i^) 



dpD e^-eo""" {e^^^ao 



F££,FDC 



poil- Y PP) (8-18) 



Fee.FDC 



We fix an arbitrary order of the elements of V. From H8.16|l . H8.17|l and (|8.18|) . it follows that 
the matrix of the Jacobian is such that the column of partial derivatives of po is the vector with 
C-component 

PD[hF^D}{C) ~ Pf), C£V, 



Fe£,FDC 



where (5{fcd}(C') is equal to 1 if C C £) and is equal to otherwise. We note first that pn is 
common to all components of the D column and therefore 

J = det A Y[pD (8.19) 
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where A is the |P| x \V\ matrix with entries 

Fe£,FDC 

We note next that in the rows rc corresponding to C E V maximal with respect to inclusion, the C 
entry, on the matrix diagonal, is the only entry such that 

and therefore, for C maximal, we can write 

rc = ec-{ ' (8-20) 

Fe£,FDC 

where ec is the P-dimensional row vector with components all equal to except for the C component, 
and 1* is the P-dimensional row vector with all its components equal to 1. 
We finally note that if Ci C C2 for Ci , C2 in T), then 

{Fe£,FDC2}c{Fe£,FD Ci} 

and therefore if, in the matrix A, for C E T) not maximal with respect to inclusion, we replace the 
row rc by 

FDCFGV 

we have 



A = Imi - Ul^ 



Y: E P^)r- (8-21) 

' FDC.FeV He£,H2F 

The determinant of A is clearly equal to the determinant of the matrix A obtained from A by 
replacing rc by r~c whenever C € D is not maximal with respect to inclusion. Using (|8.20() and 
(jHini), we have 

A = l\v\ 

where U is the column vector U = {J^fdcfev (-l)l^\^l(EHe£,HDFPff)' C e V). It is well-known 
that 

deti = 1 - 1*C/ 

Therefore 

deti = i-E( E i-^r'^'^K E p^)] 

CGV DDC,DeV He£,HDD 

= I-YpM E E (-1)™) (8-22) 

He£ {DCH,DeV} {CCD.CeV} 

According to H4.13|) . the coefficients of pH in the expression above are the sum of the entries Fd,h = 
J2{ccGD}i~^y^'^'^^) ™ column H of F. Moreover, by Lemma [4. 21 this sum '^ugh Dev ^d,h 
is equal to 1 if and only ii H E £ is decomposable, connected and nonempty. Since 1 = X^fgSo P^ ^ 
J2f&UoPf + 'Ef^.Pf, we can write 

deti = E + E ^'^ ^ E ^'^^ E ^^'"'> 

FeUo F^Uo He£ {DGH.DeV} 

= Y p'^ + J2 p^ ^ J2 p" ^ "12 p"^ E ^^'"^ 

F£Uo F^Uo H^Uo HGUo {DCH.DeV} 

= - E p"{( E pd,h)-i) 

H&Uo {DCH,DeV} 

= - E ""HPH (8.23) 
HeUo 
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From H8.19|l and (|8.23|) . we derive the first expression for J in H4.18|l . The other expressions are 
deduced by replacing the different pf by their expression with respect to {9d, D <E T>). 

Remark 8.1 When the model is not specified to be graphical but is more generally hierarchical, the 
proof above holds up to ^8.2^) . 
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